Text Zone Classification using Unsupervised Feature Learning
Identifieur interne : 000006 ( France/Analysis ); précédent : 000005; suivant : 000007Text Zone Classification using Unsupervised Feature Learning
Auteurs : Nibal Nayef [France] ; Jean-Marc Ogier [France]Source :
Abstract
Text zone classification is a vital step in the dig-itization process, without which OCR systems perform poorly.Prior methods to document zone classification have relied on largesets of hand-crafted features for training zone classifiers. Suchfeatures are usually database-dependent, and their computationis time consuming. In this work we propose a novel method fortext zone classification that relies on the approach of unsupervisedfeature learning. Within our method, feature vectors of documentzones are automatically learned by patches extraction, encodingand pooling, where feature encoding is based on a codebookof visual words. The training phase of the text classifier takesinto consideration the unbalance between text zones and non-text zones of all types. The proposed method has been tested onpublicly available standard databases, and achieved competitiveor better results compared to state-of-the-art methods. Theresults show that our approach matches well the task of textclassification, and is robust to zone shapes, orientations and size.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000118
- to stream Hal, to step Curation: 000118
- to stream Hal, to step Checkpoint: 000006
- to stream Main, to step Merge: 000018
- to stream Main, to step Curation: 000018
- to stream Main, to step Exploration: 000018
- to stream France, to step Extraction: 000006
Links to Exploration step
Hal:hal-01319899Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Text Zone Classification using Unsupervised Feature Learning</title>
<author><name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID"><orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc><address><addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation><relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="EA2118" active="#struct-300311" type="direct"><org type="institution" xml:id="struct-300311" status="VALID"><orgName>Université de La Rochelle</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author><name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID"><orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc><address><addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation><relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="EA2118" active="#struct-300311" type="direct"><org type="institution" xml:id="struct-300311" status="VALID"><orgName>Université de La Rochelle</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01319899</idno>
<idno type="halId">hal-01319899</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01319899</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01319899</idno>
<date when="2015-08-23">2015-08-23</date>
<idno type="wicri:Area/Hal/Corpus">000118</idno>
<idno type="wicri:Area/Hal/Curation">000118</idno>
<idno type="wicri:Area/Hal/Checkpoint">000006</idno>
<idno type="wicri:Area/Main/Merge">000018</idno>
<idno type="wicri:Area/Main/Curation">000018</idno>
<idno type="wicri:Area/Main/Exploration">000018</idno>
<idno type="wicri:Area/France/Extraction">000006</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Text Zone Classification using Unsupervised Feature Learning</title>
<author><name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID"><orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc><address><addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation><relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="EA2118" active="#struct-300311" type="direct"><org type="institution" xml:id="struct-300311" status="VALID"><orgName>Université de La Rochelle</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author><name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID"><orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc><address><addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation><relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="EA2118" active="#struct-300311" type="direct"><org type="institution" xml:id="struct-300311" status="VALID"><orgName>Université de La Rochelle</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Text zone classification is a vital step in the dig-itization process, without which OCR systems perform poorly.Prior methods to document zone classification have relied on largesets of hand-crafted features for training zone classifiers. Suchfeatures are usually database-dependent, and their computationis time consuming. In this work we propose a novel method fortext zone classification that relies on the approach of unsupervisedfeature learning. Within our method, feature vectors of documentzones are automatically learned by patches extraction, encodingand pooling, where feature encoding is based on a codebookof visual words. The training phase of the text classifier takesinto consideration the unbalance between text zones and non-text zones of all types. The proposed method has been tested onpublicly available standard databases, and achieved competitiveor better results compared to state-of-the-art methods. Theresults show that our approach matches well the task of textclassification, and is robust to zone shapes, orientations and size.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Poitou-Charentes</li>
</region>
<settlement><li>La Rochelle</li>
</settlement>
<orgName><li>Université de La Rochelle</li>
</orgName>
</list>
<tree><country name="France"><region name="Poitou-Charentes"><name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
</region>
<name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000006 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000006 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= France |étape= Analysis |type= RBID |clé= Hal:hal-01319899 |texte= Text Zone Classification using Unsupervised Feature Learning }}
This area was generated with Dilib version V0.6.32. |